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Abstract 


“jjj*  low  does  the  brain  recognize  three-dimensional  objects?  An  initial  step  towards  the  understanding  of  the 
V#l  eural  substrate  of  visual  object  recognition  can  be  taken  by  studying  first  the  nature  of  object  representa- 
ion,  as  manifested  in  behavioral  studies  with  humans  or  non-I  iman  primates.  One  fundamental  question 
3  whether  these  representations  are  object  or  viewer  centered.  We  trained  monkeys  to  recognize  computer 
endered  objects  presented  from  an  arbitrarily  chosen  training  view,  and  subsequently  tested  their  ability  to 
'eneralize  recognition  for  views  generated  by  mathematically  rotating  the  objects  around  any  arbitrary  axis. 

^  *1  agreement  with  human  psychophysical  work  (Rock  and  DiVita,  1987,  Bulthoff  and  Edelman,  1992),  our 

'\*U  -esults  show  that  recognition  at  the  subordinate  level  becomes  increasingly  difficult  for  the  monkey  as  the 
stimulus  is  rotated  away  from  a  familiar  attitude,  and  thus  provide  additional  evidence  in  favor  of  memorial 
representations  that  are  viewer-centered.  When  the  animals  were  trained  with  as  few  as  three  views  of  the 
object,  120"  apart,  they  could  often  interpolate  recognition  for  all  views  resulting  from  rotations  around  the 
same  axis.  The  possibility  thus  exists  that  even  in  the  case  of  a  viewer-centered  recognition  system,  a  small 
number  of  stored  views  may  suffice  to  achieve  the  view-invariant  performance  that  humans  and  non-human 
primates  typically  achieve  when  recognizing  familiar  objects.  These  results  are  also  in  agreement  with  a 
recognition  model  that  accomplishes  view-invariant  performance  by  storing  a  limited  number  of  object  views 
or  templates  together  with  the  capacity  to  interpolate  between  the  templates  (Poggio  and  Edelman,  1990). 
In  such  a  model,  the  units  involved  in  representing  a  har-ned  view  are  expected  to  exhibit  a  bellshaped 
tuning  curve  centered  around  the  learned  view,  while  interpolation  is  instantiated  in  the  summed  activity 
of  the  units. 
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1  Introduction 


Most  theories  of  object  recognition  assume  that  the  vi¬ 
sual  system  stores  a  representation  of  an  object  and 
thtit  recognition  occurs  when  this  stored  representation 
is  matched  to  its  corresponding  sensory  representation 
generated  from  the  viewed  object  [28].  What  is,  how¬ 
ever,  the  nature  of  these  representations,  what  is  stored 
in  memory,  and  how  is  matching  achieved?  A  space  of 
possible  representations  could  be  characterized  by  ad¬ 
dressing  the  issues  of  (1)  the  recognition  task,  (2)  the 
attributes  to  be  represented,  (3)  the  nature  of  primitives 
that  would  describe  these  attributes,  and  (4)  the  spatial 
reference  frame  in  respect  to  which  the  object  is  defined. 

Representations  may  vary  for  different  recognition 
tasks.  A  fundamental  task  for  any  recognition  system 
is  to  cut  up  the  environment  into  categories  the  mem¬ 
bers  of  which,  although  nonidentical,  are  conceived  of 
as  equivalent.  Such  categories  often  relate  to  each  other 
by  means  of  class  inclusion,  forming  t^lxonomies.  Ob¬ 
jects  are  usually  recognized  first  at  a  particular  level  of 
abstraction,  called  the  basic  level  [25].  For  example,  a 
Golden-retriever  is  more  likely  to  be  first  perceived  as 
a  dog,  rather  than  as  a  retriever  or  a  mammal.  Classi¬ 
fications  at  the  basic  level  carry  the  highest  amount  of 
information  about  a  category  and  are  usually  character¬ 
ized  by  distinct  shapes  [25].  Classifications  above  the 
basic  level,  superordinate  categories,  are  more  general, 
while  those  below  the  basic  level,  subordinate  categories, 
are  more  specific,  sharing  a  great  number  of  attributes 
with  other  subordinate  categories,  and  having  to  a  large 
extent  similar  shape  (for  a  thorough  discussion  of  cate¬ 
gories  see  [8,24,25]).  Representations  of  objects  at  differ¬ 
ent  taxonomic  levels  may  differ  in  their  attributes,  the 
nature  of  primitives  describing  various  attributes,  and 
the  reference  frame  used  for  the  description  of  the  ob¬ 
ject. 

In  primate  vision,  shape  seems  to  be  the  critical  at¬ 
tribute  for  object  recognition.  Material  properties,  such 
as  color  or  texture  may  be  important  primarily  at  the 
most  subordinate  levels.  Recognition  of  objects  is  typi¬ 
cally  unaffected  in  gray-scale  photographs,  line  drawings, 
or  in  cartoons  with  wrong  color  and  texture  information. 
An  elephant,  for  instance,  would  be  recognized  as  an  ele¬ 
phant,  even  if  it  were  painted  yellow  and  textured  with 
blue  spots.  Evidence  as  to  the  importance  of  shape  for 
object  perception  comes  also  from  clinical  studies  show¬ 
ing  that  the  bre2ikdown  of  recognition,  resulting  from 
circumscribed  damage  to  the  human  cerebral  cortex,  is 
most  marked  at  the  subordinate  level,  at  which  the  great¬ 
est  shape  similarities  occur  [5] . 

Models  of  recognition  differ  in  the  spatial  frame 
used  for  shape  representation.  Current  theories  using 
object-centered  representations  assume  either  a  com¬ 
plete  three-dimensional  description  of  an  object  [28],  or 
a  structural  description  of  the  image  specifying  the  re¬ 
lationships  among  viewpoint-invariant  volumetric  primi-  ^ 


lives  [1,12].  In  contrast,  viewer-centered  representations 
model  three-dimensional  objects  as  a  set  of  2D  views, 
or  aspects,  and  recognition  consists  of  matching  image 
features  against  the  views  in  this  set. 

When  tested  against  human  behavior,  object-centered 
representations  predict  well  the  view-independent  recog¬ 
nition  of  familiar  objects  [1].  However,  psychophys¬ 
ical  studies  using  familiar  objects  to  investigate  the 
processes  underlying  object  constancy,  t.e.  viewpoint- 
invariant  recognition  of  objects,  can  be  misleading  be¬ 
cause  a  recognition  system  based  on  3D  descriptions  can 
not  easily  be  discerned  from  a  viewer  centered  system 
e.xposed  to  a  sufficient  number  of  object  views.  Further¬ 
more,  object-centered  representations  fail  to  account  for 
performance  in  recognition  tasks  with  various  kinds  of 
novel  objects  at  the  subordinate  level  [4,6,18,19,27]. 

Viewer-centered  representations,  on  the  other  hand, 
can  account  for  recognition  performance  at  any  taxo¬ 
nomic  level,  but  they  have  been  often  considered  im¬ 
plausible  due  to  the  vast  amount  of  memory  required 
to  store  all  discriminable  object  views  needed  to  achieve 
viewpoint  invariance.  Yet,  recent  theoretical  work  shows 
that  a  simple  network  can  achieve  viewpoint  invariance 
by  interpolating  between  a  small  number  of  stored  views 
[16].  Computationally,  this  network  uses  a  small  set  of 
sparse  data  corresponding  to  an  object's  training  views 
to  synthesize  an  approximation  to  a  multivariate  func¬ 
tion  representing  the  object.  The  approximation  tech¬ 
nique  is  known  by  the  name  of  Generalized  Radial  Basis 
Functions  (GRBFs),  and  it  has  been  shown  to  be  math¬ 
ematically  equivalent  to  a  multilayer  network  [17].  A 
special  case  of  such  a  network  is  that  of  the  Radial  Basis 
Functions  (RBFs)  that  can  be  conceived  of  as  “hidden- 
layer”  units,  the  activity  of  which  is  a  radial  function  of 
the  disparity  between  a  novel  view  and  a  template  stored 
in  the  unit’s  memory.  Such  an  interpolation-based  net¬ 
work  makes  both  psychophysical  and  physiological  pre¬ 
dictions  [15]  that  can  be  directly  tested  against  behav¬ 
ioral  performance  and  single  cell  activity. 

In  the  experiments  described  below,  we  trained  mon¬ 
keys  to  recognize  novel  objects  presented  from  one  view, 
and  subsequently  tested  their  ability  to  generalize  recog¬ 
nition  for  views  generated  by  mathematically  rotating 
the  objects  around  arbitrary  axes.  The  stimuli,  exam¬ 
ples  of  which  are  shown  in  Figure  1,  were  similar  to 
those  used  by  Edelman  and  Biilthoff  (1992)  [6]  in  hu¬ 
man  psychophysical  experiments.  Our  aim  was  to  ex¬ 
amine  whether  non-human  primates  show  viewpoint  in¬ 
variance  at  the  subordinate  level  of  recognition.  Brief 
reports  of  these  experiments  have  been  published  previ¬ 
ously  [10,11]. 

2  Materials  and  Methods 

2.1  Subjects  and  Surgical  Procedures 

Three  juvenile  rhesus  monkeys  (Macaca  mulatta)  weigh¬ 
ing  7-9  kg  were  tested.  The  animals  were  cared  for  in 


accordance  with  the  National  Institutes  of  Health  Guide, 
and  the  guidelines  of  the  Animal  Protocol  Review  Com¬ 
mittee  of  the  Baylor  College  of  Medicine. 

The  animals  underwent  a  surgery  for  the  placement 
of  a  head  restraint  post,  and  a  scleral-search  eye  coil 
[9]  for  measuring  eye  movements.  The  monkeys  were 
given  antibiotics  (Tribrissen  30  mg/kg)  and  analgesics 
(Tylenol  10  mg/kg)  orally  one  day  before  the  operation. 
The  surgical  procedure  was  carried  out  under  strictly 
aseptic  conditions  while  the  animals  were  anesthetized 
with  isoflurane  (induction  3.5%  and  maintenance  1.2% 
-  1.5%,  at  0.8  L/min  Oxygen).  Throughout  the  surgi¬ 
cal  procedure  the  animals  received  5%  dextrose  in  lac- 
tated  Ringer’s  solution  at  a  rate  of  15  ml/kg/hr.  Heart 
rate,  blood  pressure  and  respiration  were  monitored  con¬ 
stantly  and  recorded  every  15  minutes.  Body  tempera¬ 
ture  was  kept  at  37.4  degrees  Celsius  using  a  heating 
pad.  Postoperatively,  an  opioid  anelgesic  was  admin¬ 
istered  (Buprenorphine  hydrochloride  0.02  mg/kg,  IM) 
every  6  hours  for  one  day.  Tylenol  ( 10  mg/kg)  and  an¬ 
tibiotics  (Tribrissen  30  mg/kg)  were  given  to  the  animal 
for  3-5  days  after  the  operation. 

2.2  Animal  Training 

Standard  operant  conditioning  techniques  with  positive 
reinforcement  were  used  to  train  the  monkey  to  perform 
the  task.  Initially,  the  animals  were  trained  to  recognize 
the  target's  zero  view  among  a  large  set  of  distractors, 
and  subsequently  were  trained  to  recognize  additional 
target  views  resulting  from  progressively  larger  rotations 
around  one  axis.  After  the  monkey  learned  to  recog¬ 
nize  a  given  object  from  any  viewpoint  in  the  range  of 
±90®,  the  procedure  was  repeated  with  a  new  object.  In 
the  early  stages  of  training  several  days  were  required 
to  train  the  animals  to  perform  the  same  task  for  a  new 
object.  Four  months  of  training  was  required  on  average 
for  the  monkey  to  learn  generalizing  the  task  across  dif¬ 
ferent  types  of  objects  of  one  class,  and  about  six  months 
were  required  for  the  animal  to  generalize  for  different 
types  of  object  classes. 

Within  an  object  class  the  similarity  of  the  targets 
to  the  distractors  was  gradually  increased,  and  in  the  fi¬ 
nal  stage  of  the  experiments  distractor  wire-objects  were 
generated  by  adding  different  degrees  of  positional  or  ori¬ 
entation  noise  to  the  target  objects.  A  criterion  of  95% 
correct  for  several  objects  was  required  to  proceed  with 
the  psychophysical  data  collection. 

In  the  early  phase  of  the  animal's  training  a  reward 
followed  each  correct  response.  In  the  later  stages  of  the 
training  the  animals  were  reinforced  on  a  variable-ratio 
schedule  which  administered  a  reward  after  a  specified 
average  number  of  correct  responses  had  been  given.  Fi¬ 
nally,  in  the  last  stage  of  the  behavioral  training  the 
monkey  was  rewarded  only  after  ten  consecutive  correct 
responses.  The  end  of  the  observation  period  was  sig¬ 
nalled  with  a  full-screen,  green  light  and  a  juice  reward 


for  the  monkey. 

During  the  behavioral  training,  independent  of  the  re¬ 
inforcement  schedule,  the  monkey  always  received  feed¬ 
back  as  to  the  correctness  of  its  response.  One  incorrect 
report  aborted  the  entire  observation  period.  During  the 
psychophysical  data  collection,  on  the  other  hand,  the 
monkey  was  presented  with  novel  objects  and  no  feed¬ 
back  was  given  during  the  testing  period.  The  behav¬ 
ior  of  the  animals  was  continuously  monitored  during 
the  data  collection  by  computing  on-line  hit  rate  and 
false  alarms.  To  discourage  arbitrary  performance  or 
the  development  of  hand-preferences,  t.g.  giving  only 
right  hand  responses,  sessions  of  data  collection  were 
randomly  interleaved  with  sessions  with  novel  objects, 
in  which  incorrect  responses  aborted  the  trial. 

2.3  Visual  Stimuli 

Wire-like  and  .spheroidal  objects  were  generated  mathe¬ 
matically  and  presented  on  a  color  monitor  (Figure  1). 

The  selection  of  the  vertices  of  the  wire  objects  within 
a  three-dimensional  space  was  constrained  to  exclude 
intersection  of  the  wire-segments  and  extremely  sharp 
angles  between  successive  segments,  and  to  ensure  that 
the  difference  in  the  moment  of  inertia  between  different 
wires  remained  within  a  limit  of  10%i.  Once  the  vertices 
were  selected  the  wire  objects  were  generated  by  deter¬ 
mining  a  set  of  rectangular  facets  covering  a  hypothetical 
surface  of  a  tube  of  a  given  radius  that  joined  successive 
vertices. 

The  spheroidal  objects  were  created  through  the  gen¬ 
eration  of  a  recursively-subdivided  triangle  mesh  ap¬ 
proximating  a  sphere.  Protrusions  were  generated  by 
randomly  selecting  a  point  on  the  sphere  surface  and 
stretching  it  outward.  Smoothness  was  accomplished  by 
incretising  the  number  of  triangles  forming  the  polyhe¬ 
dron  that  represents  one  protrusion.  Spheroidal  stimuli 
were  characterized  by  the  number,  sign  (negative  sign 
corresponded  to  dimples),  size,  density  and  sigma  of 
the  gaussian  type  protrusions.  Similarity  was  varied  by- 
changing  these  parameters  as  well  as  the  overall  size  of 
the  sphere. 

3  Results 

3.1  Viewpoint-Dependent  Recognition 
Performance 

Three  monkeys  and  two  human  subjects  participated  in 

this  experiment  yielding  similar  results.  Only  the  mon - 

key  data  are  presented  in  this  paper.  The  animals  were  3^ 
trained  to  recognize  any  given  object  viewed  on  one  oc-  CD 
casion  in  one  orientation,  when  presented  on  a  second  1 

occasion  in  a  different  orientation.  Technically,  this  is  - 

a  typical  recognition,  “old-new”  task,  whereby  the  sub- 

ject's  ability  to  retain  stimuli  to  which  it  has  been  ex- _ 

posed  is  tested  by  presenting  those  stimuli  intermixed 
with  other  objects  never  before  encountered.  The  sub- 
ject  is  required  to  state  for  each  stimulus  whether  it  is  - 
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"old’’,  t.e.  familiar, or  "new”,  i.t.  never  seen  before.  This 
type  of  task  is  similar  to  the  yes-no  task  of  detection  in 
psychophysics  and  can  be  studied  under  the  assumptions 
of  the  signal  detectability  theory  [7,13]. 

Figure  2a  describes  the  sequence  of  events  in  a  single 
observation  period.  Successful  fixation  of  a  central  light 
spot  was  followed  by  the  learmng  phase,  during  which 
the  monkeys  were  allowed  to  inspect  an  object,  the  tar¬ 
get,  from  a  given  viewpoint,  arbitrarily  called  the  zero 
view.  To  provide  the  subject  with  3D  structure  infor¬ 
mation,  the  target  was  presented  as  a  motion  sequence 
of  10  adjacent,  Gouraud-shaded  views,  2"  apart,  cen¬ 
tered  around  the  zero  view.  The  animation  wais  accom¬ 
plished  at  a  2  frames-per-view  temporal  rate,  i.e.  each 
view  lasted  33.3  msec,  yielding  the  impression  of  an  ob¬ 
ject  oscillating  slowly  ±10"  around  a  fixed  axis. 

The  learning  phase  was  followed  by  a  short  fixation 
period  after  which  the  testing  phase  started.  Each  test¬ 
ing  phase  consisted  of  up  to  10  trials.  The  beginning 
of  a  trial  was  indicated  by  a  low-pitched  tone,  immedi¬ 
ately  followed  by  the  presentation  of  the  test  stimulus, 
a  shaded,  static  view  of  either  the  target  or  a  distrac- 
tor.  Target  views  were  generated  by  rotating  the  object 
around  one  of  four  axes,  the  vertical,  the  horizontal,  the 
right  oblique,  or  the  left  oblique  (Fig.  2b).  Distractors 
were  other  objects  of  the  same  or  different  class  (Fig.  1). 

Two  levers  were  attached  to  the  front  panel  of  the 
monkey  chair,  and  reinforcement  was  contingent  upon 
pressing  the  right  lever  each  time  the  target  was  pre¬ 
sented.  Pressing  the  left  lever  was  required  upon  pre¬ 
sentation  of  a  distractor.  Note  (see  methods  below)  that 
no  feedback  was  given  to  the  animals  during  the  psy¬ 
chophysical  data  collection.  A  typical  experimental  ses¬ 
sion  consisted  of  a  sequence  of  60  observation  periods, 
each  of  which  lasted  about  25  seconds. 

Figure  3a  shows  the  performance  of  one  of  the  mon¬ 
keys  for  rotations  around  the  vertical  axis.  Thirty  target 
views  and  60  distractor  objects  were  used  in  this  experi¬ 
ment.  On  the  abscissa  of  the  graph  we  plot  the  rotation 
angle  and  on  the  ordinate  the  experimental  hit  rate.  The 
small  squares  show  performance  for  each  tested  view  for 
240  presentations.  The  solid  line  was  obtained  by  a  dis¬ 
tance  weighted  least  squares  smoothing  of  the  data  using 
the  McLain  algorithm  [14].  The  small  insets  show  ex¬ 
amples  of  the  tested  views.  The  monkey  could  identify 
correctly  the  views  of  the  target  around  the  zero  view, 
while  its  performance  dropped  below  chance  levels  for 
disparities  larger  than  30  degrees  for  leftward  rotations, 
and  larger  than  60  degrees  for  rightward  rotations.  Per¬ 
formance  below  chance  level  is  probably  the  result  of  the 
large  number  of  distractors  used  within  a  session,  which 
limited  learning  of  the  distractors  per  se.  Therefore  an 
object  that  was  not  perceived  as  a  target  view  was  read¬ 
ily  classified  as  distractor. 

Figure  3b  shows  the  false  alarm  rate,  that  is,  the  per¬ 
centage  of  time  that  a  distractor  object  was  reported  as 


a  view  of  the  target.  The  abscissa  shows  the  distractor 
number,  and  the  squares  the  false  alarm  rate  for  20  pre¬ 
sentations  of  each  distractor.  Recognition  performance 
for  rotations  around  the  vertical,  horizontal,  and  the  two 
oblique  axes  (±45")  can  be  seen  in  Figure  3c.  The  X  and 
Y  axis  on  the  bottom  face  of  the  plot  show  the  rotations 
in  depth,  and  the  Z  axis  the  experimental  hit  rate. 

To  exclude  the  possibility  that  the  observed  view  de¬ 
pendency  was  specific  to  non-opaque  structures  lacking 
extended  surface,  we  have  also  tested  recognition  perfor¬ 
mance  using  spheroidal,  amoeba-like  objects  with  char¬ 
acteristic  protrusions  and  concavities.  Thirty-six  views 
of  a  target  amoeba  and  120  distractors  were  used  in  any 
given  .session.  As  illustrated  in  Figure  4  the  monkey 
was  able  to  generalize  only  for  a  limited  number  of  novel 
views  clustered  around  the  views  presented  in  the  train¬ 
ing  phase.  In  contrast,  performance  v/as  found  to  be 
viewpoint-invariant  when  the  animals  were  tested  for  ba¬ 
sic  level  classifications,  or  when  they  were  trained  with 
multiple  views  of  wire-like  or  amoeba-like  objects.  Fig¬ 
ure  5  shows  the  mean  performance  of  three  monkeys  for 
each  of  the  object  classes  tested.  Each  curve  was  gener¬ 
ated  by  averaging  individual  hit  rate  measurements  ob¬ 
tained  from  different  animals  for  different  objects  within 
a  class.  The  data  in  Figure  5b  were  collected  from  three 
monkeys  using  two  shperoidal  objects.  The  asymmetric 
tuning  curve  denoting  better  recognition  performance  for 
rightwards  rotations  is  probably  due  to  asymmetric  dis¬ 
tribution  of  characteristic  protrusions  in  the  two  amoe¬ 
boid  objects.  Figure  5c  shows  the  ability  of  monkeys 
to  recognize  common  objects,  e.g.  a  teepot.  presented 
from  various  viewpoints.  Distractors  were  other  common 
objects  or  simple  geometrical  shapes.  Since  all  animals 
were  already  trained  to  perform  the  task  indepent  of  the 
object  type  used  as  a  target,  no  familiarization  with  the 
object’s  zero- view  preceded  the  data  collection  in  these 
experiments.  Yet,  the  animals  can  generalize  recognition 
for  all  tested  novel  views. 

For  some  objects  the  subjects  were  better  in  their  abil¬ 
ity  to  recognize  the  target  from  views  resulting  from 
180  degree  rotations.  This  type  of  behavior  is  evident 
in  Figure  6a  for  one  of  the  monkeys.  As  can  be  seen 
in  the  figure,  performance  drops  for  views  farther  than 
30"  but  it  resumes  as  the  unfamiliar  views  of  the  tar¬ 
get  approach  the  180"  view  of  the  target.  This  behavior 
was  specific  to  those  wire-like  objects,  for  which  the  zero 
and  180"  views  appeared  ais  mirror-symmetrical  images 
of  each  other,  due  to  accidental  minimal  self-occlusion. 
In  this  respect,  the  improvement  in  performance  paral¬ 
lels  the  refli  clional  invariance  observed  in  human  psy- 
chopliysiral  experiments  [2].  Such  reflectional  invariance 
may  also  partly  explain  the  observation  that  informa¬ 
tion  about  bilateral  symmetry  simplifies  the  task  of  3D 
recognition  by  reducing  the  number  of  views  required  to 
achieve  object  constancy  [30].  Not  surprisingly,  perfor¬ 
mance  around  the  180  degree  view  of  an  object  did  not 


improve  for  any  of  the  opaque,  spheroidal  objects  used 
in  these  experiments. 

3.2  Generalization  Field:  Simulations 

Poggio  and  Edelman  (1990)  described  a  regularization 
network  capable  of  performing  view-independent  recog¬ 
nition  of  three-dimensional  wire-like  objects,  after  initial 
training  with  a  limited  set  of  views  of  the  objects  [16]. 
The  se«,  size  in  their  experiments,  80-100  views  of  an  ob¬ 
ject  for  the  entire  viewing  sphere,  predicts  a  generaliza¬ 
tion  field  of  about  30  degrees  for  any  given  rotation  axis, 
which  is  in  agreement  with  human  psychophysical  work 
[4,6,18,19],  and  with  the  data  presented  in  this  paper. 

Figure  7  illustrates  an  example  of  such  a  network  and 
its  output  activity.  A  2D  view  (Fig.  7a)  can  be  rep¬ 
resented  as  a  vector  of  some  visible  feature  points  on 
the  object.  In  the  C2ise  of  wire  objects,  these  features 
could  be  the  x,y  coordinates  of  the  vertices,  the  ori¬ 
entation,  corners,  size,  length,  texture  and  color  of  the 
segments,  or  any  other  characteristic  feature.  In  the  ex¬ 
ample  of  Figure  7b  the  input  vector  consists  of  seven 
segment  orientations.  For  simplicity  we  assume  as  many 
basis  functions  as  the  views  in  the  training  set.  Each 
basis  unit,  Uj,  in  the  “hidden-layer”  calculates  the  dis¬ 
tance  ||V  —  T,  11  of  the  input  vector  V  from  its  center  T,  , 
t.e.  its  learned  or  “preferred”  view,  and  it  subsequently 
computes  the  function  exp(-|lV  -  Tj||)  of  this  distance. 
The  value  of  this  function  is  regarded  as  the  activity  of 
the  unit  and  it  peaks  when  the  input  is  the  trained  view 
itself.  The  activity  of  the  network  is  conceived  of  as 
the  weighted,  linear  sum  of  each  unit’s  output.  In  the 
present  simulations  we  assume  that  each  unit's  output 
is  superimposed  on  Gaussian  noise,  N(V,<r„),  the  sigma 
of  which  was  estimated  from  single-unit  data  in  the 
inferotemporal  cortex  of  the  macaque  monkey  [11]. 

The  four  plots  in  Figure  7c  show  the  output  of  each 
RBF  unit  when  presented  with  views  generated  by  ro¬ 
tations  around  the  vertical  axis.  Units  Ui  through  U4 
are  centered  on  the  0,  60,  120,  and  180  degree  views  of 
the  object  respectively.  The  abscissa  of  the  plots  shows 
the  rotation  angle  and  the  ordinate  the  unit’s  output 
normalized  at  its  response  to  its  center.  Note  the  bell¬ 
shaped  response  of  each  unit  as  the  target  object  is  ro¬ 
tated  away  from  its  familiar  attitude.  The  output  of  each 
unit  can  be  highly  asymmetric  around  the  center  since 
the  independent  variable  in  the  plots  (rotation  angle)  is 
different  from  the  argument  of  the  exponential  function. 
Figure  7d  shows  the  total  activity  of  the  network  under 
“zero”  noise  conditions.  The  thick,  gray  line  on  the  left 
plot  illustrates  the  network’s  output  when  the  input  is 
any  of  the  36  tested  target  views.  The  right  plot  shows 
its  mean  activity  for  any  of  the  36  views  of  each  of  the  60 
distractors.  The  thick,  blaick  lines  in  Figures  7b,  c,  and  d 
show  the  representation  and  the  activity  of  the  same  net¬ 
work  when  trained  with  only  the  zero  view,  simulating 
the  actual  psychophysical  experiments  described  above. 


To  directly  compare  the  network  performance  with  the 
psychophysical  data  described  above  we  used  the  same 
wire  objects  used  in  our  first  experiment  (Generalization 
Fields),  and  applied  a  decision  theoretic  analysis  on  the 
network's  output  [7].  In  Figure  8a  the  curve  /t(A'),  to 
the  right,  represents  the  distribution  of  network  activ¬ 
ities  that  occur  on  those  occasions,  in  which  the  input 
is  a  view  of  the  target.  Accordingly,  the  curve  /d(A'), 
to  the  left,  represents  the  distribution  of  activities  when 
the  input  is  a  given  distractor.  The  abscissa  of  the  graph 
represents  stimulus  strength,  which  increases  for  increas¬ 
ing  familiarity  of  the  object,  that  is  for  views  nearer  to 
the  trained  view.  Taken  as  an  ideal  observer's  opera¬ 
tion,  the  network's  decision  to  respond  “old"  (target)  or 
“new”  (distractor)  depends  on  an  adopted  decision  crite¬ 
rion  Xc-  The  gray  area  on  the  right  of  A'e  represents  the 
a  posteriori  probability  of  the  network  correctly  identi¬ 
fying  a  target,  and  it  is  denoted  with  P{'T\T).  while  the 
dark  cross-hatched  area  on  the  right  of  Xc  represents 
the  probability  P(T)D)  of  a  false  alarm.  On  the  left 
of  Xc ,  the  area  marked  with  horizontal  lines  gives  the 
probability  of  a  correct  rejection,  and  the  area  with  verti¬ 
cal  lines  represents  the  probability  of  failing  to  recognize 
the  target.  As  the  cutoff  point  Ac  runs  through  its  pos¬ 
sible  values,  it  generates  a  curvilinear  relation  between 
P(T|T)  and  P(T|D)  (Fig.  8b)  known  as  the  Receiver 
Operating  Characteristic  (ROC)  curve.  The  area  un¬ 
derneath  this  curve  has  bee  shown  to  amount  to  the 
percentage  correct  performance  of  an  ideal  observer  in 
a  two-alternative  forced-choice  (2AFC)  task  [7]  (page 
45-47).  In  this  model,  performance  depends  solely  on 
the  distance  d'  between  the  means  of  the  /t(A')  and 
/i>(A')  distributions,  revealing  the  actual  sensitivity  of 
the  recognition  system.  The  distance  d'  is  determined 
in  standard  deviation  units.  A  basic  assumption  in  this 
type  of  analysis  is  that  the  events  leading  to  an  “old”  or 
“new”  response  are  normally  distributed.  Therefore,  the 
selection  of  the  vertices  of  the  wire-like  objects  was  con¬ 
strained  to  ensure  that  the  activity  of  the  network  across 
the  set  of  different  distractors  was  distributed  normally 
(Fig.  8c). 


The  white  bars  in  Figure  9a  show  the  distribution  of 
the  network  activity  when  the  input  was  any  of  the  60 
distractor  wire  objects.  Black  bars  represent  the  activ¬ 
ity  distribution  for  a  given  target  view  (-50,  -30,  0,  30, 
and  50  degrees).  Complete  ROC  curves  for  views  gener¬ 
ated  by  leftward  and  rightward  rotations  are  illustrated 
in  Figures  9b  and  c  respectively.  Figure  9d  shows  the 
performance  of  the  network  as  an  observer  in  a  2AFC 
task.  Open  squares  represent  the  area  under  the  cor¬ 
responding  ROC  curve,  and  the  gray,  thick  line  shows 
modeling  of  the  data  with  a  gaussian  function  computed 
^  using  the  Quasi-Newton  minimization  technique. 


3.3  Generalization  Field:  Psychophysics 

The  purpose  of  these  experiments  was  to  generate  psy¬ 
chometric  curves  that  could  be  used  for  comparing  the 
psychophysical,  physiological,  and  computational  data 
in  the  context  of  the  above  task.  One  way  to  generate 
ROC  curves  in  psychophysical  experiments  is  to  vary 
the  a  priori  probability  of  signal  occurance,  and  instruct 
the  observer  to  maximize  the  percentage  of  correct  re¬ 
sponses.  Since  the  training  of  the  monkeys  was  designed 
to  maximize  the  animal’s  correct  responses,  changing 
the  a  priori  probability  of  target  occurance  did  induce 
a  change  in  the  animal's  decision  criterion  as  is  evident 
in  the  variation  of  hits  and  false  alarms  in  each  curve  of 
the  Figures  10a  and  b. 

The  data  were  obtained  by  .setting  the  a  prtort  prob¬ 
ability  of  target  occurance  in  a  block  of  observation  pe¬ 
riods  to  O.2.  0.4,  0.6,  or  0.8.  Figures  10a  and  b  show 
ROC  curves  for  leftward  and  rightward  rotations  respec¬ 
tively  Each  curve  is  created  from  the  four  pairs  of  hit 
and  false  alarm  rates  obtained  for  one  given  target  view. 
All  target  views  were  tested  using  the  same  set  of  distrac- 
tors.  The  percentage-correct  performance  of  the  monkey 
is  plotted  in  Figure  10c.  Each  filled  circle  represents  the 
area  under  the  corresponding  ROC  curve  in  Figures  10a 
and  b.  The  thick,  gray  line  shows  modeling  of  the  data 
with  a  gaussian  function.  Note  the  similarity  between 
the  monkey’s  performance  and  the  simulated  data  (thin 
gray  line). 

3.4  Interpolation  between  two  trained  views 

A  network,  such  as  that  in  Figure  7,  represents  an  object 
by  a  set  of  2D  views,  the  templates,  and  when  the  ob¬ 
ject’s  attitude  changes,  the  network  generalizes  through 
nonlinear  interpolation.  In  the  simple  case,  in  which 
the  number  of  basis  functions  is  taken  to  be  equal  to  the 
number  of  views  in  the  training  set,  intepolation  depends 
on  the  c,  and  <t  of  the  basis  functions,  and  on  the  dis¬ 
parity  between  the  training  views.  Furthermore,  unlike 
schemes  based  on  linear  combination  of  2£>  views  [29], 
the  non-linear  interpolation  model  predicts  recognition 
of  novel  views  beyond  the  above  measured  generalization 
field  to  occur  for  only  those  views  situated  between  the 
templates. 

To  test  this  prediction  experimentally,  the  ability  of 
the  monkeys  to  generalize  recognition  to  novel  views 
was  examined  after  training  the  animals  with  two  suc¬ 
cessively  presented  views  of  the  target  120®  and  160® 
apart. 

The  results  of  such  an  experiment  are  illustrated  in 
Figures  11a  and  b.  The  monkey  was  initially  trained  to 
identify  the  0®  and  120®  views  of  a  wire-like  object  among 
120  distractor  objects  of  the  same  class.  During  this  pe¬ 
riod  the  animal  wm  given  feedback  as  to  the  correctness 
of  the  response.  IVaining  was  considered  complete  when 
the  monkey’s  hit  rate  was  consistently  above  95%,  false 
alarm  rate  remained  below  10%,  and  the  dispersion  co¬ 


efficient  of  reaction  times  was  minimized.  A  total  of  600 
presentations  were  required  to  achieve  the  above  condi¬ 
tions.  after  w'hich  testing  and  data  collection  began. 

During  a  single  observation  period,  the  monkey  was 
first  shown  the  familiar  0®  and  120"  views  of  the  ob¬ 
ject.  and  then  presented  sequentially  with  10  stimuli  that 
could  be  either  target  or  distractor  views.  Within  one 
experimental  session  each  of  the  36  tested  target  views 
was  presented  30  times.  The  spikes  on  the  VZ  plane  of 
the  plot  show  the  hit  rate  for  each  view  generated  by 
rotations  around  the  axis.  The  solid  line  represents  a 
distance- weighted,  least-squares  smoothing  of  the  data 
using  the  McLain  algorithm  [14].  The  results  show  that 
interpolation  between  familiar  views  Jiiay  be  the  only 
generalization  achieved  by  the  monkev  s  recognition  sys¬ 
tem.  No  extrapolation  is  evident  with  the  exception  of 
the  slightly  increased  hit  rate  for  views  around  the  — 120® 
view  of  the  object,  that  approximately  corresponds  to  a 
180  degree  rotation  of  some  of  the  interpolated  views. 

The  contour  plot  summarizes  the  performance  of  the 
monkey  for  views  generated  by  rotating  the  object 
around  the  horizontal,  vertical,  and  the  two  oblique  axes. 
Thirty  six  views  were  tested  for  each  axis,  each  presented 
30  times.  The  results  show  that  the  ability  of  the  monkey 
to  recognize  novel  views  is  limited  to  the  space  spanned 
between  the  two  trained  views  as  predicted  by  the  model 
of  nonlinear  approximation. 

The  experiment  was  repeated  after  briefly  training  the 
monkey  to  recognize  the  60®  view  of  the  object.  Dur¬ 
ing  the  second  "training  period”  the  animal  was  simply 
given  feedback  as  to  the  correctness  of  the  response  for 
the  60®  view  of  the  object.  The  results  can  be  seen  in 
Figure  11(b).  The  animal  was  able  to  recognize  all  views 
between  the  0°  and  120®  views.  Moreover,  performance 
improved  significantly  around  the  —120®. 

4  Discussion 

The  main  findings  of  this  study  are  (a)  that  recogni¬ 
tion  of  a  novel,  three-dimensional  object  depends  on  the 
viewpoint  from  w’hich  the  object  is  encountered,  and  (b) 
that  perceptual  object-constancy  can  be  achieved  by  fa¬ 
miliarization  with  a  limited  number  of  views. 

The  first  demonstration  of  strong  viewpoint  depen¬ 
dence  in  the  recognition  of  novel  objects  was  that  of  Rock 
and  his  collaborators  [18,19].  These  investigators  exam¬ 
ined  the  ability  of  human  subjects  to  recognize  three- 
dimensional,  smoothly  curved  wire-like  objects  seen  from 
one  viewpoint,  when  encountered  from  a  different  atti¬ 
tude  and  thus  having  a  different  2D  projection  on  the 
retina.  Although  their  stimuli  were  real  objects  (made 
from  2.5mm  wire),  and  provided  the  subject  with  full 
3D  information,  there  was  a  sharp  drop  in  recognition 
for  view  disparities  larger  that  approximately  30  degrees. 
In  fact,  as  subsequent  investigations  showed,  subjects 
could  not  even  imagine  how  wire  objects  look  when  ro¬ 
tated,  despite  instructions  for  visualizing  the  object  from 


another  viewpoint  [31].  Similar  results  were  obtained  in 
later  experiments  by  Edelman  and  Biilthoff  (1992)  with 
computer-rendered,  wire-like  objects  presented  stereo- 
scopically  or  as  flat  images  [4,6]. 

In  this  paper  we  provide  evidence  of  similar  view- 
dependency  of  recognition  for  the  nonhuman  primate. 
Monkeys  were  indeed  unable  to  recognize  objects  ro¬ 
tated  more  than  approximately  40  degrees  of  visual  angle 
from  a  familiar  view.  These  results  are  hard  to  recon¬ 
cile  with  theories  postulating  object-centered  representa¬ 
tions.  .Such  theories  predict  uniform  performance  across 
different  object  views,  provided  3D  information  is  avail¬ 
able  to  the  subject  at  the  time  of  the  first  encounter. 
Therefore,  one  question  calling  for  discussion  is  whether 
or  not  information  about  the  object  s  structure  was  avail¬ 
able  to  the  monkeys  during  the  learning  phase  of  these 
experiments. 

First  of  all,  wires  are  visible  in  their  entirety  since, 
unlike  most  opaque  natural  objects  in  the  environment, 
regions  in  front  do  not  substantially  occlude  regions  in 
back.  Second,  the  objects  were  computer- rendered  with 
appropriate  shading  and  were  presented  in  slow  oscilla¬ 
tory  motion.  The  motion  parallax  effects  produced  by 
such  motion  yield  vivid  and  accurate  perception  of  the 
3D  structure  of  an  object  or  surface  [3,20].  In  fact,  psy¬ 
chometric  functions  showing  depth  modulation  thresh¬ 
olds  as  a  function  of  spatial  frequency  of  3D  corruga¬ 
tions  are  very  similar  for  surfaces  specified  through  ei¬ 
ther  disparity  or  motion  parallax  cues  [21-23].  Further¬ 
more,  experiments  on  monkeys  have  shown  that  nonhu¬ 
man  primates,  too,  possess  the  ability  to  see  structure 
from  motion  [26]  in  random-dot  kinematograms.  Thus, 
during  the  learning  phase  of  each  observation  period,  in¬ 
formation  about  the  three-dimensional  structure  of  the 
target  was  available  to  the  monkey  by  virtue  of  shading, 
the  kinetic  depth  effect,  and  minimal  self-occlusion. 

Could  the  view-dependent  behavior  of  the  animals  be 
a  result  of  the  monkeys'  failing  to  understand  the  task? 
The  monkey  could  indeed  recognize  a  two-dimensional 
pattern  as  such,  without  necessarily  perceiving  it  as  a 
view  of  an  object.  Correct  performance  around  the  fa¬ 
miliar  view  could  then  be  simply  explained  as  the  inabil¬ 
ity  of  the  animal  to  discriminate  adjacent  views.  Several 
lines  of  arguments  refute  such  an  interpretation  of  the 
obtained  results.  For  one,  the  animals  easily  generalized 
recognition  to  all  novel  views  of  common  objects.  More¬ 
over,  when  the  wire-like  objects  had  prominent  charac¬ 
teristics,  such  as  one  or  more  sharp  angles,  or  a  closure, 
the  monkeys  were  able  to  perform  in  a  view-invariant 
fashion.  Second,  when  two  views  of  the  target  were  pre¬ 
sented  in  the  training  phase  the  animals  interpolated, 
often  with  100%  performance,  for  any  view  between  the 
two  trained  views. 

Third,  for  many  wire-like  objects  the  animal’s  recogni¬ 
tion  was  found  to  exceed  criterion  performance  for  views 
that  resembled  “mirror-symmetrical”,  two-dimensional 


images  of  each  other,  due  to  accidental  lack  of  self¬ 
occlusion.  Invariance  for  reflections  has  been  reported 
earlier  in  the  literature  [2],  and  it  clearly  represents  a 
fornj  of  generalization.  Finally,  human  subjects  that 
were  tested  for  comparison  using  the  same  apparatus 
exhibited  recognition  performance  very  similar  to  that 
of  the  tested  monkeys. 

Thus,  it  appears  that  monkeys,  just  like  human  sub¬ 
jects.  show  rotational  invariance  for  familiar,  basic-level 
objects,  but  they  fail  to  generalize  recognition  at  the  sub¬ 
ordinate  level,  when  fine,  shape-bcised  discriminations 
are  required  to  recognize  an  object.  Interestingly,  train¬ 
ing  with  a  limited  number  of  views  (about  10  views  for 
the  entire  viewing  sphere)  was  sufficient  for  all  the  mon¬ 
keys  tested  to  achieve  view-independent  performance. 

Recognition  based  entirely  on  fine,  shape  discrimina¬ 
tions  is  not  uncommon  in  daily  life.  We  are  certainly  able 
to  recognize  modern  sculptures,  mountains  or  cloud  for¬ 
mations.  The  largely  view  independent  basic  level  recog¬ 
nition  exhibited  by  adults  may  be  the  result  of  learning 
of  certain  irreducible  shapes  early  in  life.  Even  those  the¬ 
ories  suggesting  that  recognition  involves  the  indexing  of 
a  limited  number  of  volumetric  components  [1]  and  the 
detection  of  their  relationships  have  to  face  the  problem 
of  learning  components  that  cannot  be  further  decom¬ 
posed.  In  other  words,  we  still  have  to  achieve  represen¬ 
tations  of  some  elementary  object  forms  that  transcend 
the  special  viewpoint  of  the  observer.  Such  representa¬ 
tions  usually  rely  on  shape  coding  that  is  very  similar  to 
that  required  for  the  subordinate  level  of  recognition. 

5  Conclusions 

Our  results  provide  evidence  supporting  viewer-centered 
object  representation  in  the  primate,  at  least  for  sub¬ 
ordinate  level  cleissifications.  While  monkeys,  just  like 
human  subjects,  show  rotational-invariance  for  familiar, 
basic-level  objects,  they  fail  to  generalize  recognition  for 
rotations  more  than  30  to  40  degrees  when  fine,  shaped- 
based  discriminations  are  required  to  recognize  an  ob¬ 
ject.  The  psychophysical  performance  of  the  animals  is 
consistent  with  the  idea  that  view-based  approximation 
modules  synthesized  during  training  may  indeed  be  one 
of  several  algorithms  the  primate  visual  system  uses  for 
object  recognition. 

The  visual  stimuli  used  m  these  experiments  were 
designed  to  provide  accurate  descriptions  of  the  three- 
dimensional  structure  of  the  objects.  Therefore  our  find¬ 
ings  are  unlikely  to  be  the  result  of  insufficent  depth 
information  in  the  two-dimensional  images  for  building 
a  three-dimensional  representation.  Rather,  it  suggests 
that  construction  of  viewpoint-invariant  representations 
may  not  be  possible  for  a  novel  object.  Thus  the  view¬ 
point  invariant  performance  typically  observed  when  rec¬ 
ognizing  familiar  objects  may  eventually  be  the  result  of 
a  sufficient  number  of  two-dimensional  representations, 
created  for  each  experienced  viewpoint.  The  number  of 


viewpoints  is  likely  to  depend  on  the  class  of  an  object 
and  may  reach  a  minimun]  for  novel  objects  that  belong 
to  a  familiar  class,  thereby  sharing  sufficiently  similar 
transformation  properties  with  the  other  class  members. 
Recognition  of  an  individual  new  face  seen  from  one  sin¬ 
gle  view  may  be  such  an  example. 
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Figure  1;  Example  of  three  stimulus  objects  used  in  the  experiments  on  object  recognition,  (a)  Wire-like,  (b) 
spheroidal,  and  (c)  common  objects  were  rendered  by  a  computer  and  displayed  on  a  color  monitor.  The  middle  column  of 
the  ’Targets’  shows  the  view  of  each  object  as  it  appeared  in  the  learning  phase  of  an  observation  period.  This  view  was 
arbitrarily  called  the  zero  view  of  the  object.  Columns  1,  2,  4,  and  5  show  the  views  of  each  object  when  rotated  -48,  -24. 
24,  and  48  degrees  about  a  vertical  axis  respectively.  The  rightmost  column  shows  an  example  of  a  distractor  object  for  each 
object  class.  Sixty  to  120  distractor  objects  were  used  in  each  experiment. 
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Figure  2:  Experimental  paradigm  (a)  Description  of  the  task.  An  observation  period  consisted  of  a  learning  phase,  within 
which  the  target  object  was  presented  oscillating  ±10”  around  a  fixed  axis,  and  a  testing  phase  during  which  the  subjects  were 
presented  with  up  to  10  single,  static  views  of  either  the  target  or  the  distractors.  The  small  inset  in  this  and  the  following 
figures  show  examples  of  the  tested  views.  The  subject  had  to  respond  by  pressing  one  of  two  levers,  right  for  the  target,  and 
left  for  the  distractors.  (b)  Description  of  the  stimulus  space.  The  viewpoint  coordinates  of  the  observer  with  respect  to  the 
object  were  defined  as  the  longitude  and  the  latitude  of  the  eye  on  a  virtual  sphere  centered  on  the  object.  X'iewing  the  object 
from  an  attitude  a,  e.g.  —60°  with  respect  to  the  zero  view,  corresponded  to  a  60°  rightwards  rotation  of  the  object  around 
the  vertical  axis,  while  viewing  from  an  attitude  b  amounted  to  a  rightwards  rotation  around  the  -45°  axis.  Recognition  was 
tested  for  views  generated  by  rotations  around  the  vertical  (Y),  horizontal  (X),  and  the  two  oblique  (±45°)  axes  lying  on  the 
XY  plane. 
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Figure  3:  Recognition  performance  as  a  function  of  rotation  in  depth  for  wire-like  objects.  Data  from  the  monkey 
B63A.  (a)  The  abscissa  of  the  graph  shows  the  rotation  angle  and  the  ordinate  the  hit  rate.  The  small  squares  show 
performance  for  each  tested  view  for  240  presentations.  The  solid  lines  were  obtained  by  a  distance  weighted  least  squares 
smoothing  of  the  data  using  the  McLmn  algorithm.  When  the  object  is  rotated  more  than  about  30  to  40  degrees  away 
performance  falls  below  40%.  (b)  False  alarms  for  the  120  different  distractor  objects.  The  abscissa  shows  the  distractor 
number,  and  the  squares  false  alarm  rate  for  20  distractor  presentations,  (c)  Recognition  performance  for  rotations  around 
the  vertical,  horizontal,  and  the  two  oblique  axes  (±45°). 
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Figure  5:  Mean  recognition  performance  as  a  function  of  rotation  in  depth  for  different  types  of  objects,  (a)  and 
(b)  show  data  averag^  from  three  monkeys  for  the  wire  and  spheroidal  objects.  Performance  of  the  monkey  S5396  for 
common-type  objects.  Conventions  as  in  figure  3a. 
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Figure  6:  Improvement  of  recognition  performance  for  views  generated  by  180°  rotations  of  wire-like  objects.  Data 
from  monkey  S5396  Conventions  as  in  figure  3(a).  This  type  of  performance  was  specific  to  only  those  wire-like  objects,  the 
zero  and  180°  views  of  which  resembled  mirror  symmetrical  two-dimensional  images  due  to  accidental  lack  of  self-occlusion. 
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Figure  7:  A  network  for  object  recognition  (a)  A  view  is  represented  as  a  vector  of  some  visible  feature  points  on  the 
object.  On  the  wire  objects  these  features  could  be  the  coordinates  of  the  vertices,  the  orientation,  size,  length  and  color 
of  the  segments,  etc.  (b)  An  example  of  an  RBF  network  in  which  the  input  vector  consists  of  the  segment  orientations.  For 
simplicity  we  assume  as  many  basis  functions  as  the  views  in  the  training  set,  in  this  example  four  views  (0,  60, 120,  and  180 
degrees).  Each  basis  unit,  Uj,  in  the  “hidden-layer”  calculates  the  distance  ||V  —  Tj|(  of  the  input  vector  V  from  its  center  Tj, 
i.e.  its  learned  or  “preferred”  view,  and  it  subsequently  computes  the  function  exp(— ||V  —  Tj||)  of  this  distance.  The  value  of 
this  function  is  regarded  as  the  activity  of  the  unit,  and  it  peaks  when  the  input  is  the  trained  view  itself.  The  activity  of  the 
network  is  conceived  as  the  weighted,  linear  sum  of  each  unit's  output  superimpose  to  Gaussian  noise  (e  €.  N(V,a^)).  Thick 
lines  show  an  instance  of  the  network  that  was  trained  only  with  the  zero  view  of  the  target,  (c)  Plots  1-4  show  the  output 
of  each  RBF  unit,  under  “zero-noise”  conditions,  when  the  unit  is  presented  with  views  generated  by  rotations  around  the 
vertical  axis,  (d)  Network  output  for  target  and  distractor  views.  The  thick  gray  line  on  the  left  plot  depicts  the  activity  of 
the  network  trained  with  4  and  the  black  line  with  one  view  (the  zero  view).  The  right  plot  shows  the  the  network’s  output 
for  36  views  of  60  distractors. 
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Figure  8:  Decision  theoretic  analysis  of  the  network  output,  (a)  The  curve  /t(  A'),  to  the  right,  represents  the  distribution 
of  network  activities  that  occur  on  those  occasions  when  the  input  is  a  view  of  the  target.  The  curve  fviX),  to  the  left, 
represents  the  distribution  of  activities  when  the  input  is  a  given  distractor.  The  network's  decision  whether  an  input  is  a 
target  or  a  distractor  depends  on  the  decision  criterion  A'c-  The  gray  area  on  the  right  of  A'c  represents  the  probability 
P(T|T)  of  the  network  correctly  identifying  a  target  and  the  dark  dotted  area  on  the  right  of  A'c  represents  the  probability 
P(T|Z))  of  a  false  alarm.  On  the  left  of  A’c,  the  area  marked  with  horizontal  lines  gives  the  probability  of  correct  rejections, 
and  the  area  with  vertical  lines  represents  the  probability  of  failing  to  recognize  a  target,  (b)  As  Ac  runs  through  its  possible 
values  it  generates  a  curvilinear  relation  between  P(T|T)  and  P(T|i>)  (thick  black  line),  the  area  underneath  which  has  been 
shown  to  amount  to  the  criterion  independent  percentage-correct  responses  of  an  ideal  observer  in  a  2AFC  task.  The  later 
discriminability  measure  depends  only  on  the  distance  d'  between  the  distractor  and  target  distributions,  (c)  Multiple  normal 
probability  density  functions  can  be  approximated  by  a  single  gaussian  distribution,  indicated  by  the  thick  gray  line,  when 
the  means  of  the  distributions  are  separated  by  a  fraction  of  the  standard  deviation. 
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Figure  9;  Reciever  operating  characteristic  (ROC)  curves  and  performance  of  the  RBF  network,  (a)  White  bars  show 
the  distribution  of  the  network  activity  when  the  input  was  any  of  the  60  distractor  wire  objects.  Black  bars  represent  the 
actvity  distribution  for  a  given  target  view  (-50,  -30,  0,  30,  and  50  degrees),  (b)  Reciever  operating  characteristic  curves  for 
views  generated  by  leftward  rotations,  (c)  Reciever  operating  characteristic  curves  for  views  generated  by  rightward  rotations, 
(d)  Network  performance  as  an  observer  in  a  2AFC  task.  Filled  squares  represent  the  activity  of  the  network.  The  solid  line 
is  the  distance  weighted  least  squares  smoothing  of  the  data  for  all  tested  views.  The  dashed  line  shows  chance  performance. 
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Figure  10:  ROC  curves  from  one  monkey  in  the  old-new  task  used  to  study  recognition.  The  data  were  obtained  by 
varying  the  a  priori  probability  of  target  occurance  in  block  of  observation  periods.  The  values  used  in  this  experiment  were 
0.2,  0.4,  0.6,  and  0.8.  (a)  Each  curve  corresponds  to  a  set  of  hit  and  false  alarm  rate  values  measured  for  a  rightward  rotation. 
Rotations  were  done  in  15°  steps,  (b)  Same  as  in  (a),  but  for  leftward  rotations,  (c)  Recognition  performance  for  different 
object  views.  Each  filled  circle  represents  the  area  under  the  corresponding  ROC  curve.  The  solid  line  models  the  data  with 
a  single  gaussian  function. 
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Figure  11:  Interpolation  between  two  trained  views,  (a)  In  the  learning  phase  the  monkey  was  presented  sequentially  with 
the  0°  and  120°  views  of  a  wire-like  object,  and  subsequently  tested  with  36  views  around  any  of  the  four  axes  (horizontal, 
vertical  and  the  two  obliques).  The  spikes  normal  to  the  contour-plot  show  the  hit  rate  for  rotations  around  the  Y  axis.  Note 
the  somewhat  increased  hit  rate  for  views  around  the  —120°  view.  The  contour  plot  shows  the  performance  of  the  for  views 
generated  by  rotating  the  object  around  either  of  the  horizontal,  vertical,  and  the  two  oblique  axes,  (b)  Repetition  of  the 
same  experiment  after  briefly  training  the  monkey  with  the  60°  view  of  the  wire  object.  The  animal  can  now  recognize  any 
view  in  the  range  of  —30°  to  140°  as  well  as  around  the  —120°  view.  As  predicted  by  the  RBF  model,  generalization  is  limited 
to  views  between  the  two  trained  views. 


